{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<h1> Serving embeddings </h1>\n",
    "\n",
    "This notebook illustrates how to:\n",
    "<ol>\n",
    "<li> Create a custom embedding as part of a regression/classification model\n",
    "<li> Representing categorical variables in different ways\n",
    "<li> Math with feature columns\n",
    "<li> Serve out the embedding, as well as the original model's predictions\n",
    "</ol>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# change these to try this notebook out\n",
    "BUCKET = 'cloud-training-demos-ml'\n",
    "PROJECT = 'cloud-training-demos'\n",
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['BUCKET'] = BUCKET\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['REGION'] = REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Updated property [core/project].\n",
      "Updated property [compute/region].\n"
     ]
    }
   ],
   "source": [
    "%bash\n",
    "gcloud config set project $PROJECT\n",
    "gcloud config set compute/region $REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "if ! gcloud storage ls | grep -q gs://${BUCKET}/; then\n",    "  gcloud storage buckets create --location=${REGION} gs://${BUCKET}\n",    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Creating dataset\n",
    "\n",
    "The problem is to estimate demand for bicycles at different rental stations in New York City.  The necessary data is in BigQuery:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "#standardsql\n",
    "WITH bicycle_rentals AS (\n",
    "  SELECT\n",
    "    COUNT(starttime) as num_trips,\n",
    "    EXTRACT(DATE from starttime) as trip_date,\n",
    "    MAX(EXTRACT(DAYOFWEEK from starttime)) as day_of_week,\n",
    "    start_station_id\n",
    "  FROM `bigquery-public-data.new_york.citibike_trips`\n",
    "  GROUP BY trip_date, start_station_id\n",
    "),\n",
    "\n",
    "rainy_days AS\n",
    "(\n",
    "SELECT\n",
    "  date,\n",
    "  (MAX(prcp) > 5) AS rainy\n",
    "FROM (\n",
    "  SELECT\n",
    "    wx.date AS date,\n",
    "    IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp\n",
    "  FROM\n",
    "    `bigquery-public-data.ghcn_d.ghcnd_2016` AS wx\n",
    "  WHERE\n",
    "    wx.id = 'USW00094728'\n",
    ")\n",
    "GROUP BY\n",
    "  date\n",
    ")\n",
    "\n",
    "SELECT\n",
    "  num_trips,\n",
    "  day_of_week,\n",
    "  start_station_id,\n",
    "  rainy\n",
    "FROM bicycle_rentals AS bk\n",
    "JOIN rainy_days AS wx\n",
    "ON wx.date = bk.trip_date\n",
    "\"\"\"\n",
    "import google.datalab.bigquery as bq\n",
    "df = bq.Query(query).execute().result().to_dataframe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>num_trips</th>\n",
       "      <th>day_of_week</th>\n",
       "      <th>start_station_id</th>\n",
       "      <th>rainy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>71373</th>\n",
       "      <td>11</td>\n",
       "      <td>4</td>\n",
       "      <td>3050</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53403</th>\n",
       "      <td>339</td>\n",
       "      <td>3</td>\n",
       "      <td>497</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36856</th>\n",
       "      <td>114</td>\n",
       "      <td>3</td>\n",
       "      <td>259</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83250</th>\n",
       "      <td>95</td>\n",
       "      <td>5</td>\n",
       "      <td>377</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56575</th>\n",
       "      <td>35</td>\n",
       "      <td>4</td>\n",
       "      <td>270</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       num_trips  day_of_week  start_station_id  rainy\n",
       "71373         11            4              3050  False\n",
       "53403        339            3               497   True\n",
       "36856        114            3               259  False\n",
       "83250         95            5               377  False\n",
       "56575         35            4               270  False"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# shuffle the dataframe to make it easier to split into train/eval later\n",
    "df = df.sample(frac=1.0)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Feature engineering\n",
    "Let's build a model to predict the number of trips that start at each station, given that we know the day of the week and whether it is a rainy day.\n",
    "\n",
    "Inputs to the model:\n",
    "* day of week (integerized, since it is 1-7)\n",
    "* station id (hash buckets, since we don't know full vocabulary. The dataset has about 650 unique values. we'll use a much larger hash bucket size, but then embed it into a lower dimension)\n",
    "* rainy (true/false)\n",
    "\n",
    "Label:\n",
    "* num_trips\n",
    "\n",
    "By embedding the station id into just 2 dimensions, we will also get to learn which stations are like each other, at least in the context of rainy-day-rentals."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### Change data type\n",
    "\n",
    "Let's change the Pandas data types to more efficient (for TensorFlow) forms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "num_trips           int64\n",
       "day_of_week         int64\n",
       "start_station_id    int64\n",
       "rainy                bool\n",
       "dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "num_trips           float32\n",
       "day_of_week           int32\n",
       "start_station_id      int32\n",
       "rainy                object\n",
       "dtype: object"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "df = df.astype({'num_trips': np.float32, 'day_of_week': np.int32, 'start_station_id': np.int32, 'rainy': str})\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### Scale the label to make it easier to optimize."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "df['num_trips'] = df['num_trips'] / 1000.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split into 104148 training examples and 26037 evaluation examples\n"
     ]
    }
   ],
   "source": [
    "num_train = (int) (len(df) * 0.8)\n",
    "train_df = df.iloc[:num_train]\n",
    "eval_df  = df.iloc[num_train:]\n",
    "print(\"Split into {} training examples and {} evaluation examples\".format(len(train_df), len(eval_df)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>num_trips</th>\n",
       "      <th>day_of_week</th>\n",
       "      <th>start_station_id</th>\n",
       "      <th>rainy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>71373</th>\n",
       "      <td>0.011</td>\n",
       "      <td>4</td>\n",
       "      <td>3050</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53403</th>\n",
       "      <td>0.339</td>\n",
       "      <td>3</td>\n",
       "      <td>497</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36856</th>\n",
       "      <td>0.114</td>\n",
       "      <td>3</td>\n",
       "      <td>259</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83250</th>\n",
       "      <td>0.095</td>\n",
       "      <td>5</td>\n",
       "      <td>377</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56575</th>\n",
       "      <td>0.035</td>\n",
       "      <td>4</td>\n",
       "      <td>270</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       num_trips  day_of_week  start_station_id  rainy\n",
       "71373      0.011            4              3050  False\n",
       "53403      0.339            3               497   True\n",
       "36856      0.114            3               259  False\n",
       "83250      0.095            5               377  False\n",
       "56575      0.035            4               270  False"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<h2> Creating an Estimator model </h2>\n",
    "\n",
    "Pretty minimal, but it works!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import pandas as pd\n",
    "\n",
    "def make_input_fn(indf, num_epochs):\n",
    "  return tf.estimator.inputs.pandas_input_fn(\n",
    "    indf,\n",
    "    indf['num_trips'],\n",
    "    num_epochs=num_epochs,\n",
    "    shuffle=True)\n",
    "\n",
    "def serving_input_fn():\n",
    "    feature_placeholders = {\n",
    "      'day_of_week': tf.placeholder(tf.int32, [None]),\n",
    "      'start_station_id': tf.placeholder(tf.int32, [None]),\n",
    "      'rainy': tf.placeholder(tf.string, [None])\n",
    "    }\n",
    "    features = {\n",
    "        key: tf.expand_dims(tensor, -1)\n",
    "        for key, tensor in feature_placeholders.items()\n",
    "    }\n",
    "    return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)\n",
    "  \n",
    "def train_and_evaluate(output_dir, nsteps):\n",
    "  station_embed = tf.feature_column.embedding_column(\n",
    "      tf.feature_column.categorical_column_with_hash_bucket('start_station_id', 5000, tf.int32), 2)\n",
    "  feature_cols = [\n",
    "    tf.feature_column.categorical_column_with_identity('day_of_week', num_buckets = 8),\n",
    "    station_embed,\n",
    "    tf.feature_column.categorical_column_with_vocabulary_list('rainy', ['false', 'true'])\n",
    "  ]\n",
    "  estimator = tf.estimator.LinearRegressor(\n",
    "                       model_dir = output_dir,\n",
    "                       feature_columns = feature_cols)\n",
    "  train_spec=tf.estimator.TrainSpec(\n",
    "                       input_fn = make_input_fn(train_df, None),\n",
    "                       max_steps = nsteps)\n",
    "  exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)\n",
    "  eval_spec=tf.estimator.EvalSpec(\n",
    "                       input_fn = make_input_fn(eval_df, 1),\n",
    "                       steps = None,\n",
    "                       start_delay_secs = 1, # start evaluating after N seconds\n",
    "                       throttle_secs = 10,  # evaluate every N seconds\n",
    "                       exporters = exporter)\n",
    "  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)\n",
    "  \n",
    "import shutil\n",
    "OUTDIR='./model_trained'\n",
    "shutil.rmtree(OUTDIR, ignore_errors=True)\n",
    "train_and_evaluate(OUTDIR, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Predict using the exported model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting test.json\n"
     ]
    }
   ],
   "source": [
    "%writefile test.json\n",
    "{\"day_of_week\": 3, \"start_station_id\": 384, \"rainy\": \"false\"}\n",
    "{\"day_of_week\": 4, \"start_station_id\": 384, \"rainy\": \"true\"}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PREDICTIONS\n",
      "[0.09803415834903717]\n",
      "[0.07751345634460449]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: /usr/local/envs/py2env/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "2018-07-17 18:49:13.727528: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%bash\n",
    "EXPORTDIR=./model_trained/export/exporter/\n",
    "MODELDIR=$(ls $EXPORTDIR | tail -1)\n",
    "gcloud ml-engine local predict --model-dir=${EXPORTDIR}/${MODELDIR} --json-instances=./test.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Serving out the embedding also\n",
    "\n",
    "To serve out the embedding, we need to use a model function (a custom estimator) so that we have access to output_alternates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import pandas as pd\n",
    "\n",
    "def make_input_fn(indf, num_epochs):\n",
    "  return tf.estimator.inputs.pandas_input_fn(\n",
    "    indf,\n",
    "    indf['num_trips'],\n",
    "    num_epochs=num_epochs,\n",
    "    shuffle=True)\n",
    "\n",
    "def serving_input_fn():\n",
    "    feature_placeholders = {\n",
    "      'day_of_week': tf.placeholder(tf.int32, [None]),\n",
    "      'start_station_id': tf.placeholder(tf.int32, [None]),\n",
    "      'rainy': tf.placeholder(tf.string, [None])\n",
    "    }\n",
    "    features = {\n",
    "        key: tf.expand_dims(tensor, -1)\n",
    "        for key, tensor in feature_placeholders.items()\n",
    "    }\n",
    "    return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)\n",
    "\n",
    "def model_fn(features, labels, mode):\n",
    "  # linear model\n",
    "  station_col = tf.feature_column.categorical_column_with_hash_bucket('start_station_id', 5000, tf.int32)\n",
    "  station_embed = tf.feature_column.embedding_column(station_col, 2)  # embed dimension\n",
    "  embed_layer = tf.feature_column.input_layer(features, station_embed)\n",
    "  \n",
    "  cat_cols = [\n",
    "    tf.feature_column.categorical_column_with_identity('day_of_week', num_buckets = 8),\n",
    "    tf.feature_column.categorical_column_with_vocabulary_list('rainy', ['false', 'true'])\n",
    "  ]\n",
    "  cat_cols = [tf.feature_column.indicator_column(col) for col in cat_cols]\n",
    "  other_inputs = tf.feature_column.input_layer(features, cat_cols)\n",
    "  \n",
    "  all_inputs = tf.concat([embed_layer, other_inputs], axis=1)\n",
    "  predictions = tf.layers.dense(all_inputs, 1)  # linear model\n",
    "  \n",
    "  # 2. Use a regression head to use the standard loss, output, etc.\n",
    "  my_head = tf.contrib.estimator.regression_head()\n",
    "  spec = my_head.create_estimator_spec(\n",
    "    features = features, mode = mode, labels = labels, logits = predictions,\n",
    "    optimizer = tf.train.FtrlOptimizer(learning_rate = 0.1)\n",
    "  )\n",
    "  \n",
    "  # 3. Create predictions\n",
    "  predictions_dict = {\n",
    "    \"predicted\": predictions,\n",
    "    \"station_embed\": embed_layer\n",
    "  }\n",
    "    \n",
    "  # 4. Create export outputs\n",
    "  export_outputs = {\n",
    "    \"predict_export_outputs\": tf.estimator.export.PredictOutput(outputs = predictions_dict)\n",
    "  }\n",
    "\n",
    "  # 5. Return EstimatorSpec after modifying the predictions and export outputs\n",
    "  return spec._replace(predictions = predictions_dict, export_outputs = export_outputs)\n",
    "\n",
    "def train_and_evaluate(output_dir, nsteps):\n",
    "  estimator = tf.estimator.Estimator(\n",
    "                       model_fn = model_fn,\n",
    "                       model_dir = output_dir)\n",
    "  train_spec=tf.estimator.TrainSpec(\n",
    "                       input_fn = make_input_fn(train_df, None),\n",
    "                       max_steps = nsteps)\n",
    "  exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)\n",
    "  eval_spec=tf.estimator.EvalSpec(\n",
    "                       input_fn = make_input_fn(eval_df, 1),\n",
    "                       steps = None,\n",
    "                       start_delay_secs = 1, # start evaluating after N seconds\n",
    "                       throttle_secs = 10,  # evaluate every N seconds\n",
    "                       exporters = exporter)\n",
    "  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)\n",
    "  \n",
    "import shutil\n",
    "OUTDIR='./model_trained'\n",
    "shutil.rmtree(OUTDIR, ignore_errors=True)\n",
    "train_and_evaluate(OUTDIR, 1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PREDICTED              STATION_EMBED\n",
      "[0.08442232012748718]  [0.0008054960635490716, 0.0008597194100730121]\n",
      "[0.0879446491599083]   [0.0008054960635490716, 0.0008597194100730121]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: /usr/local/envs/py2env/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "2018-07-17 16:28:14.124627: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%bash\n",
    "EXPORTDIR=./model_trained/export/exporter/\n",
    "MODELDIR=$(ls $EXPORTDIR | tail -1)\n",
    "gcloud ml-engine local predict --model-dir=${EXPORTDIR}/${MODELDIR} --json-instances=./test.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Explore embeddings\n",
    "\n",
    "Let's explore the embeddings for some stations. Let's look at stations with overall similar numbers of trips. Do they have similar embedding values?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "stations=\"\"\"\n",
    "  SELECT\n",
    "    COUNT(starttime) as num_trips,\n",
    "    MAX(start_station_name) AS station_name,\n",
    "    start_station_id\n",
    "  FROM `bigquery-public-data.new_york.citibike_trips`\n",
    "  GROUP BY start_station_id\n",
    "  ORDER BY num_trips desc\n",
    "\"\"\"\n",
    "stationsdf = bq.Query(stations).execute().result().to_dataframe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>num_trips</th>\n",
       "      <th>station_name</th>\n",
       "      <th>start_station_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>359182</td>\n",
       "      <td>Pershing Square North</td>\n",
       "      <td>519</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>291615</td>\n",
       "      <td>E 17 St &amp; Broadway</td>\n",
       "      <td>497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>277060</td>\n",
       "      <td>Lafayette St &amp; E 8 St</td>\n",
       "      <td>293</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>275348</td>\n",
       "      <td>W 21 St &amp; 6 Ave</td>\n",
       "      <td>435</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>268807</td>\n",
       "      <td>8 Ave &amp; W 31 St N</td>\n",
       "      <td>521</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   num_trips           station_name  start_station_id\n",
       "0     359182  Pershing Square North               519\n",
       "1     291615     E 17 St & Broadway               497\n",
       "2     277060  Lafayette St & E 8 St               293\n",
       "3     275348        W 21 St & 6 Ave               435\n",
       "4     268807      8 Ave & W 31 St N               521"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stationsdf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>num_trips</th>\n",
       "      <th>station_name</th>\n",
       "      <th>start_station_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>500</th>\n",
       "      <td>2828</td>\n",
       "      <td>W 87 St &amp; West End Ave</td>\n",
       "      <td>3287</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>501</th>\n",
       "      <td>2808</td>\n",
       "      <td>47 Ave &amp; 31 St</td>\n",
       "      <td>3221</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>502</th>\n",
       "      <td>2785</td>\n",
       "      <td>21 St &amp; 41 Ave</td>\n",
       "      <td>3237</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>503</th>\n",
       "      <td>2734</td>\n",
       "      <td>Hanson Pl &amp; Ashland Pl</td>\n",
       "      <td>3429</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>504</th>\n",
       "      <td>2678</td>\n",
       "      <td>E 67 St &amp; Park Ave</td>\n",
       "      <td>3133</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     num_trips            station_name  start_station_id\n",
       "500       2828  W 87 St & West End Ave              3287\n",
       "501       2808          47 Ave & 31 St              3221\n",
       "502       2785          21 St & 41 Ave              3237\n",
       "503       2734  Hanson Pl & Ashland Pl              3429\n",
       "504       2678      E 67 St & Park Ave              3133"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stationsdf[500:505]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting test.json\n"
     ]
    }
   ],
   "source": [
    "%writefile test.json\n",
    "{\"day_of_week\": 4, \"start_station_id\": 435, \"rainy\": \"true\"}\n",
    "{\"day_of_week\": 4, \"start_station_id\": 521, \"rainy\": \"true\"}\n",
    "{\"day_of_week\": 4, \"start_station_id\": 3221, \"rainy\": \"true\"}\n",
    "{\"day_of_week\": 4, \"start_station_id\": 3237, \"rainy\": \"true\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "435 and 521 are in the first list (of top rental locations) and in the Chelsea Market area.\n",
    "3221 and 3237 are in the second list (of rare rentals) and in Long Island.\n",
    "Do the embeddings reflect this?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PREDICTED              STATION_EMBED\n",
      "[0.08976395428180695]  [-2.3062024411046878e-05, 0.008066227659583092]\n",
      "[0.08778293430805206]  [-3.2270579595206073e-06, 0.0011660171439871192]\n",
      "[0.08674221485853195]  [1.1034319868485909e-05, -0.00245896028354764]\n",
      "[0.08657103031873703]  [9.333280104328878e-06, -0.0030552244279533625]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: /usr/local/envs/py2env/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "2018-07-10 14:11:42.625068: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%bash\n",
    "EXPORTDIR=./model_trained/export/exporter/\n",
    "MODELDIR=$(ls $EXPORTDIR | tail -1)\n",
    "gcloud ml-engine local predict --model-dir=${EXPORTDIR}/${MODELDIR} --json-instances=./test.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "In this case, the first dimension of the embedding is almost zero in all cases. So, we only need a one dimensional embedding. And in that, it is quite clear that the Manhattan, frequent rental stations have positive values (0.0081, 0.0011) whereas the Long Island, rare rental stations have negative values (-0.0025, -0.0031)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Copyright 2018 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
